The Grammar of Graphics


ggplot2 is one of the core packages under the tidyverse package.
It is more flexible and versatile than the graphs produced by the base R package.

The “gg” stands for “Grammar of Graphics”, a book by Leland Wilkinson that offers tools to concicley describe the components of a graphic.

ggplot2 logic stems from this idea, that you can build every graph from the same few components: a data set, visual marks (geoms) representing the data, and a coordinate system.

As Hadley Wickham explained: “You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”

Grammatical elements of ggplot2


A key feature of ggplot2 is that it allows to layer graphical elements on top of each other, creating elaborate visualizations.

  • Data - the data frame we want to use for our plot
  • Aesthetics (aes) - the scales we want to map our data onto
  • Geometrics (geom) - the geometrical shapes representing our data
  • Themes - the appearance of the non-data aspects of the plot
  • Statistics - representations of the date
  • Coordinates/Scales - the range and limits our plot
  • Facets - the layout of multiple plots and subplots

The first three elements: data, aesthetics (aes), and geometrics (geom), are the basic elements.
We must define them in the ggplot function in order to produce a meaningful plot.

The remaining elements are “optional”, that is, they are set to a default. This means we are not required to define them when we plot, though typically we would want to adjust them.

In this presentation I will focus mainly on the first three and the most commonly used geoms.

Lets get to work!

Installing packages


Begin by installing and loading the tidyverse package, which includes ggplot2, among other usefull packages such as dplyr and tidyr which are used for manipulating data prior to plotting.

You only need to install the package once, but you will need to “load” it every time you restart a session

If you solely want to install the ggplot2 package you can use a similar line of code, but you will most likely use dplyr, so you may as well install tideyverse which includes both (and more)

Our Data


For this exercise we will use diamonds from the dataset package, and the gapminder dataset from the gapminder package. Both are available on r.

## Warning: package 'gapminder' was built under R version 3.6.3

We will start working with the diamonds data.

The first step should always be to examine the dataset. What variable we have? What datatype is each variable? How many observations are included?

You can use the structure function str(), or the summary function summary() if you want more details on each variable.

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

If only want to know the variable names, you can simply list the column of the dataset using colnames().

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

Now let’s continue exploring the data by plotting it with ggplot2

The ggplot2 syntax


The first line of code in ggplot2 requires us to input the data frame we intend to use, and the aesthetics we want to map our data on. This line typically includes all the data needed for creating the plot. The function synatx is writtern as: ggplot(data, aes())

For instance, to plot the price of diamonds based on their carat we need to set “diamonds” as the data, and map “carat” and “price” onto the x and y aesthetics.

The function can be written either as: ggplot(data = diamonds, aes(x = carat, y = price))
or simply as: ggplot(diamonds, aes(carat, price))

This creates the base layer of our plot, which includes the dimensions we defined for the aesthetics. In order to present the observations, we need to add geometric layers. For every layer we add, we need to place a “+” sign.

For instance, to present a trend line of the average price by carat, we can add a geom_smooth() layer. This geom creates a regression line with a confidence intervals.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

However, a regression line is not very telling about the observations. In this instance, it would make more sense to create a scatterplot in order to see the spread of the observations. We can do this by adding a geom_point() layer.

Scatterplots

Many of the observations are overlapping, making it difficult to see the actual distribution. To help remedy overplotting, we can adjust the transparency of the points by reducing the alpha and also increase the size of the points inside the geom_point layer.

This looks better, but it is still difficult to make insights from this plot. We can add another aesthetic mapping to deferantiate between diamonds with different cuts. In this example we will map “cut” onto the color aesthetic in the ggplot line,

As mentioned previously, ggplot2 enables us to add multiple geom layers on top of each other. Each new geom layer will appear on top of the previous layers. And don’t forget to add another “+” sign.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The aesthetics defined in the first line are automatically adopted by all the geom layers. Aesthetics defined in an individual geom layer affect only that geom layer, and can override aesthetic mappings from the main ggplot() line.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can add multiple geom layers of the same type

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 38 rows containing missing values (geom_smooth).

Each geom type has multiple arguments which are set to to default values, which we can easily change based on our needs. For instance, geom_point can take arguments relating to x, y, alpha, color, fill, shape, and weight . In the previous examples we changed the alpha and size of the points.

For a cheat sheet with ggplot2 geom argumentsby rStudio visit this link.

Improving the plot

Before we continue, let’s make make our lives a bit easier. Instead of typing the function over and over again, we can assign the function to an object and simply add layers to that object.

Now we can add layers and adjustments to “dd” which already containts our predefined ggplot() + geom_point() .

vertical lines

We can add lines to indicate the median and mean of carat. To add vertical and horizontal lines we use geom_vline() and geom_hline() correspondingly.

We can also add tags to the lines with geom_tex() to indicate what they represent.

Even though we improved the plot, we can see that much of the data is condenced on the left side of the plot. We can handle this by adjusting the data or, better yet, adjusting the scale.

Adjusting the data

Using dplyr functions, we can filter out observations greater than 3 carats. We’ll create a new dataset by saving the filtered data into an object called “smallD

We then plot the same aesthetics using the new data frame “smallD

Adusting the scales

Instead of filtering out extreme observations, we can adjust the x axis, either by changing its limits with xlim(), or by LOGing the values of the x scale with scale_x_log10()

Limiting the scale deletes the points outside the limit range

## Warning: Removed 32 rows containing missing values (geom_point).

Limiting the x scale for the diamond dataset created a graph that is identical to the one we created with the smallD dataframe.


Loging the scale keeps all the data points, but stretches the axis exponentially

LOGing is useful when the data is very skewed, as in the case of the gapminder data. But for the diamond dataset, I would probably choose to limit the axis scale.

Facets & Themes


We can further exmaine diferences by arranging the data into subplots. with facet_grid() and facet_wrap()

Bar Charts

The height of bars geom_bar() represents the number of cases in each group. Thus it only takes an “x” aesthetic.

The height of bars geom_col() represents other other values in the data, which is why it also requires a “y” aesthetic. _

geom_bar()

Asignng the color aesthetic would change the color of the outlines rather then the fill of the bars. to change the color of the bars we use the fill aesthetic.

Assigning the fill to another variable, splits each bar into subgroups

The default position is set to “stack”, which is why the cut levels are stacked upon each other. The other options are position = “fill” which fills each bar to represent 100%. The third option is position = “dodge” which places the groups next to eachother

Finaly, you can also change the direction of the bar by fliping it 90 degrees with coord_flip(), or create a circular center with coord_polar()

Line graphs

Line graphs produced by geom_line are suitable for longitudinal data in which we desire to show variance over time, or between different treatments. For the diamond data a line graph will look like a hot mess.

To demostrate the line geom, We will transfer to the gapminder data which contains information on life expectancy of countries at different points of time

The Gapminder data


## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

I created a new datafram by grouping continenet and year, and adding a new variable of the average life expectency

## # A tibble: 60 x 4
## # Groups:   year [12]
##     year continent   totalPop AverageLifeExp
##    <int> <fct>          <dbl>          <dbl>
##  1  1952 Africa     237640501           39.1
##  2  1952 Americas   345152446           53.3
##  3  1952 Asia      1395357351           46.3
##  4  1952 Europe     418120846           64.4
##  5  1952 Oceania     10686006           69.3
##  6  1957 Africa     264837738           41.3
##  7  1957 Americas   386953916           56.0
##  8  1957 Asia      1562780599           49.3
##  9  1957 Europe     437890351           66.7
## 10  1957 Oceania     11941976           70.3
## # … with 50 more rows

Reminder That LOGing scales helps when the data is very skewed scales and LOG. So lets put everythin together.

###
Thank you!


Tutorials

Continue learning and practicing ggplot2 on your own:


  1. Data Visualization - in R for Data Science - Hadley Wickham’s e-book
  2. The Complete ggplot2 Tutorial - by Selva Prabhakaran
  3. Stack Overflow - Great for asking questions from the community
  4. Data Camp course - first lesson of each course is free
  5. Interactive charts - convert your ggplot2 figures into interactive ones powered by plotly.js